{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "0c6c50fc-15e1-4767-925a-53a37c430b9b",
      "metadata": {},
      "source": [
        "# How to load HTML\n",
        "\n",
        "The HyperText Markup Language or [HTML](https://en.wikipedia.org/wiki/HTML) is the standard markup language for documents designed to be displayed in a web browser.\n",
        "\n",
        "This covers how to load `HTML` documents into a LangChain [Document](https://api.js.langchain.com/classes/langchain_core.documents.Document.html) objects that we can use downstream.\n",
        "\n",
        "Parsing HTML files often requires specialized tools. Here we demonstrate parsing via [Unstructured](https://unstructured-io.github.io/unstructured/). Head over to the integrations page to find integrations with additional services, such as [FireCrawl](/docs/integrations/document_loaders/web_loaders/firecrawl).\n",
        "\n",
        ":::info Prerequisites\n",
        "\n",
        "This guide assumes familiarity with the following concepts:\n",
        "\n",
        "- [Documents](https://api.js.langchain.com/classes/_langchain_core.documents.Document.html)\n",
        "- [Document Loaders](/docs/concepts/document_loaders)\n",
        "\n",
        ":::\n",
        "\n",
        "## Installation\n",
        "\n",
        "```{=mdx}\n",
        "import Npm2Yarn from \"@theme/Npm2Yarn\"\n",
        "\n",
        "<Npm2Yarn>\n",
        "  @langchain/community @langchain/core\n",
        "</Npm2Yarn>\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "868cfb85",
      "metadata": {},
      "source": [
        "## Setup\n",
        "\n",
        "Although Unstructured has an open source offering, you're still required to provide an API key to access the service. To get everything up and running, follow these two steps:\n",
        "\n",
        "1. Download & start the Docker container:\n",
        "  \n",
        "```bash\n",
        "docker run -p 8000:8000 -d --rm --name unstructured-api downloads.unstructured.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0\n",
        "```\n",
        "\n",
        "2. Get a free API key & API URL [here](https://unstructured.io/api-key), and set it in your environment (as per the Unstructured website, it may take up to an hour to allocate your API key & URL.):\n",
        "\n",
        "```bash\n",
        "export UNSTRUCTURED_API_KEY=\"...\"\n",
        "# Replace with your `Full URL` from the email\n",
        "export UNSTRUCTURED_API_URL=\"https://<ORG_NAME>-<SECRET>.api.unstructuredapp.io/general/v0/general\" \n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a4d93b2e",
      "metadata": {},
      "source": [
        "## Loading HTML with Unstructured"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "id": "7d167ca3-c7c7-4ef0-b509-080629f0f482",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "[\n",
            "  Document {\n",
            "    pageContent: 'Word of the Day',\n",
            "    metadata: {\n",
            "      category_depth: 0,\n",
            "      languages: [Array],\n",
            "      filename: 'wordoftheday.html',\n",
            "      filetype: 'text/html',\n",
            "      category: 'Title'\n",
            "    }\n",
            "  },\n",
            "  Document {\n",
            "    pageContent: ': April 10, 2023',\n",
            "    metadata: {\n",
            "      emphasized_text_contents: [Array],\n",
            "      emphasized_text_tags: [Array],\n",
            "      languages: [Array],\n",
            "      parent_id: 'b845e60d85ff7d10abda4e5f9a37eec8',\n",
            "      filename: 'wordoftheday.html',\n",
            "      filetype: 'text/html',\n",
            "      category: 'UncategorizedText'\n",
            "    }\n",
            "  },\n",
            "  Document {\n",
            "    pageContent: 'foible',\n",
            "    metadata: {\n",
            "      category_depth: 1,\n",
            "      languages: [Array],\n",
            "      parent_id: 'b845e60d85ff7d10abda4e5f9a37eec8',\n",
            "      filename: 'wordoftheday.html',\n",
            "      filetype: 'text/html',\n",
            "      category: 'Title'\n",
            "    }\n",
            "  },\n",
            "  Document {\n",
            "    pageContent: 'play',\n",
            "    metadata: {\n",
            "      category_depth: 0,\n",
            "      link_texts: [Array],\n",
            "      link_urls: [Array],\n",
            "      link_start_indexes: [Array],\n",
            "      languages: [Array],\n",
            "      filename: 'wordoftheday.html',\n",
            "      filetype: 'text/html',\n",
            "      category: 'Title'\n",
            "    }\n",
            "  },\n",
            "  Document {\n",
            "    pageContent: 'noun',\n",
            "    metadata: {\n",
            "      category_depth: 0,\n",
            "      emphasized_text_contents: [Array],\n",
            "      emphasized_text_tags: [Array],\n",
            "      languages: [Array],\n",
            "      filename: 'wordoftheday.html',\n",
            "      filetype: 'text/html',\n",
            "      category: 'Title'\n",
            "    }\n",
            "  }\n",
            "]\n"
          ]
        }
      ],
      "source": [
        "import { UnstructuredLoader } from \"@langchain/community/document_loaders/fs/unstructured\";\n",
        "\n",
        "const filePath = \"../../../../libs/langchain-community/src/tools/fixtures/wordoftheday.html\"\n",
        "\n",
        "const loader = new UnstructuredLoader(filePath, {\n",
        "  apiKey: process.env.UNSTRUCTURED_API_KEY,\n",
        "  apiUrl: process.env.UNSTRUCTURED_API_URL,\n",
        "});\n",
        "\n",
        "const data = await loader.load()\n",
        "console.log(data.slice(0, 5));"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "TypeScript",
      "language": "typescript",
      "name": "tslab"
    },
    "language_info": {
      "codemirror_mode": {
        "mode": "typescript",
        "name": "javascript",
        "typescript": true
      },
      "file_extension": ".ts",
      "mimetype": "text/typescript",
      "name": "typescript",
      "version": "3.7.2"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}